LLMs Playing and Commentating on Go: Current State (2025)

by Peter de Blanc + ChatGPT Deep Research
Posted to Adarie (www.adarie.com) on March 31, 2025
Content License: Creative Commons CC0 (No Rights Reserved)


Published Evaluations

Academic research has begun examining how general-purpose large language models (LLMs) handle strategy games like Go. One 2024 study introduced a hybrid approach combining LLMs with Monte Carlo Tree Search (MCTS) for chess and Go, using the LLM as a move-selector and value estimator ([2403.05632] Can Large Language Models Play Games? A Case Study of A Self-Play Approach) ([2403.05632] Can Large Language Models Play Games? A Case Study of A Self-Play Approach). Their experiments showed that a raw LLM alone performs very poorly at these games, but integrating it with self-play search improved results (though still nowhere near dedicated Go engines). Another line of research has looked at commentary generation in games. For example, Kim et al. (2024) focus on chess commentary and note a general pattern: expert game-playing models (like AlphaGo) make strong decisions but can’t explain them, while LLMs produce fluent natural-language commentary yet tend to hallucinate or mis-evaluate because they lack actual decision-making skill (). This reflects the broader finding that LLMs are not reliable “players,” but they can verbally describe game situations – a gap that researchers are attempting to bridge through fine-tuning and new frameworks.

Notably, there is no standard benchmark for GPT-4 or similar LLMs on Go-playing strength published in literature. The capability of LLMs in Go has mostly been explored in ad-hoc experiments rather than formal benchmarks. The consensus from early evaluations is that without special assistance or training, general LLMs have trivial playing strength in Go. (By contrast, in chess – a much lower-complexity game – GPT-4 was observed to play at only amateur level, and even that required careful prompting or fine-tuning (I played chess against ChatGPT-4 and lost | Hacker News) (I played chess against ChatGPT-4 and lost | Hacker News).) In Go’s case, the enormous search space and need for precise spatial reasoning make it an even tougher challenge for a language model that hasn’t been explicitly trained for it.

Informal Experiments and Anecdotes

Enthusiasts and community members have run numerous informal tests to gauge how well models like GPT-4 or Claude can play Go and comment on positions. The overwhelming finding is that LLMs struggle mightily with actual gameplay. For example, soon after GPT-4’s release (2023), users on forums and Reddit tried playing Go against it. One user describes that GPT-4 becomes “hopelessly lost” on a standard 19×19 board – after only a few moves it would start forgetting the board state, resulting in illegal or nonsensical moves (I played chess against ChatGPT-4 and lost | Hacker News). To even initiate a game, users had to prompt the model carefully (since ChatGPT might refuse a direct “let’s play Go” request). With clever prompting, GPT-4 would output moves in coordinate notation and even attempt commentary on each move. However, the game would fall apart quickly as the model lost track of which stones were where. As one experimenter put it: on 19×19 it failed after ~5 moves, and even on a small 9×9 board it lost track by move 7 and started producing invalid moves (I played chess against ChatGPT-4 and lost | Hacker News).

Community posts corroborate these issues. A Reddit user (Counting_Zenist) who ran a series of 9×9 games against GPT-4 reported that the model’s “reflections” and reasoning often broke down – it would miscount liberties, place stones on occupied points, and generally require constant correction until the game became unplayable (Ask questions over Sensei's Library ChatGPT-style - Page 5 - Go Resources - Online Go Forum) (Ask questions over Sensei's Library ChatGPT-style - Page 5 - Go Resources - Online Go Forum). Even with carefully structured prompts and reminders, the LLM eventually forgets earlier moves due to context length limits or internal state errors. Another Go player recounts handily beating GPT-4 in a game of Go and noted that while the AI could chat about the moves, its actual play was very weak (I played chess against ChatGPT-4 and lost | Hacker News). In a discussion on Hacker News, users concluded that GPT-4 can be “trivially defeated by any human after they grasp the rules” (I played chess against ChatGPT-4 and lost | Hacker News), highlighting that its move choices are basically at beginner/random level once out of well-known opening patterns.

Despite these failures in gameplay, some anecdotes highlight the conversational strengths of LLMs. The same user who beat GPT-4 noted that the model was able to discuss the game reasonably afterwards – for instance, it could describe what each side was trying to do in broad terms and express ideas like “that move aimed to secure the corner.” In one Substack blog, an enthusiast shares GPT-4’s commentary during their game, remarking that “it plays pretty poorly, but it can talk about the game like it knows what’s happening.” In other cases, people have prompted ChatGPT to review professional Go games or answer Go trivia. The responses tend to be fluently written and sprinkled with Go jargon (influence, territory, joseki, etc.), but the accuracy can be hit-or-miss. On the LifeIn19x19 forum, members shared ChatGPT-generated commentaries on pro games; the text was impressively flowery and coherent – for example, describing a semi-final match with strategic highlights and even quoting an imaginary proverb (ChatGPT about Go • Life In 19x19) (ChatGPT about Go • Life In 19x19). However, it was unclear if the AI correctly identified the real mistakes and turning points, or if it was mainly weaving a plausible-sounding narrative. Users noted that ChatGPT’s analysis often sounded generic or incorrectly assigned significance to moves, indicating it wasn’t truly “reading” the game like an expert, but rather imitating the style of Go commentaries it had seen in training data.

In summary, informal experiments uniformly show that without special aid, general LLMs cannot play Go competently. They tend to break the rules or make obviously bad moves soon into the game. Yet, those same models can produce running commentary that at least superficially resembles what a Go commentator or teacher might say. This has led some in the Go community to experiment with using LLMs as a talkative companion – for example, explaining rules or giving high-level advice to beginners – while being fully aware that any deep tactical insight from the LLM is unreliable.

Observed Playing Strength

All evidence to date places the playing strength of unaugmented LLMs at the absolute beginner level in Go. GPT-4, despite its prowess in other domains, does not come close to the expertise of even a casual human Go player. Observers have variously compared its skill to “random moves” or a novice who knows the rules but has no training. In the opening few moves, an LLM might choose moves that look reasonable (e.g. opening on 4-4 points or approaching a corner) – likely because such patterns appear in its training data. But as the game progresses, the lack of an internal game-state representation and planning becomes evident. The model may start playing incoherent moves that no human would ever play, or repeat moves (as it forgets a stone is already at a location). Illegal moves (like placing a stone on top of another, or playing moves that violate Ko or suicidal placement rules) eventually crop up if the game continues long enough (Ask questions over Sensei's Library ChatGPT-style - Page 5 - Go Resources - Online Go Forum). One user humorously remarked that GPT-4 on 9×9 at least didn’t immediately collapse, but it “played very poorly” and by about move 7 had completely lost the position (I played chess against ChatGPT-4 and lost | Hacker News). On 19×19, even reaching move 5 without issue was rare (I played chess against ChatGPT-4 and lost | Hacker News).

To put this in Go ranking terms: a brand-new human player is around 30 kyu in strength. GPT-4’s level might be around that ballpark or worse – essentially unranked. It does not consistently beat pure random play bots, because while it may do better than random in the very early moves (avoiding trivial blunders like filling in its own eyes, perhaps), it soon self-destructs by failing to respond meaningfully to the opponent. In fact, a human playing randomly but following the rules might exploit GPT’s mistakes easily once the model starts erring. No LLM tested has demonstrated the ability to play a valid full 19×19 game of Go, let alone a strong game. By contrast, in chess (a much simpler state space), GPT-4 can often sustain legal play for 20-30 moves and roughly achieve the strength of a weak club player (~1000 Elo) under optimal conditions (I played chess against ChatGPT-4 and lost | Hacker News). Go is exponentially more complex, and the observed behavior is that LLMs fail much sooner in the game. As a result, we can confidently say current general-purpose LLMs have negligible playing strength in Go – they lack the specialized reasoning or search that even a low-tier Go AI or a trained human beginner would apply.

Commentary and Position Understanding

While LLMs fall flat at choosing good moves, they excel at talking about the game in a general sense. These models have ingested large amounts of text, which likely include Go books, articles, forum discussions (e.g. Sensei’s Library pages, Reddit posts about Go, etc.), and even professional game commentaries. Thus, an LLM like GPT-4 can recall or recombine all sorts of Go knowledge: famous proverbs (“take care of the corners first”), common strategy tips (influence vs territory, thick vs thin shapes), and historical anecdotes. Users have found that if you ask an LLM a question like “What is the strategy behind the Kobayashi opening in Go?” or “Explain why move A is bigger than move B in this position,” the model will produce a seemingly thoughtful answer drawing on known principles. For instance, one person asked ChatGPT to write a commentary on a particular high-level game; the output was a very coherent narrative describing the flow of opening, middle, and endgame, highlighting purported mistakes and good moves by each player (ChatGPT about Go • Life In 19x19) (ChatGPT about Go • Life In 19x19). The style and terminology were spot-on – it read like a real commentary – but it’s important to note that the LLM wasn’t truly analyzing the game in a technical sense. It was more “playing back” patterns of commentary it had seen, rather than computing which moves were actually mistakes. This leads to the risk of hallucinated insights: the model might say “Black’s move at 77 was an error allowing White to seize the initiative” (ChatGPT about Go • Life In 19x19) even if, in reality or according to an engine, that move was perfectly fine. Without an actual game engine’s judgment, the LLM has no reliable way to know which moves are objectively good or bad – it’s basing commentary on plausible-sounding heuristics and the overall narrative it constructed.

On simpler position understanding tasks, LLMs show mixed results. If you describe a Go position in words or give the list of moves so far, ChatGPT can sometimes identify obvious features (like “White has a weak group on the top side” or “Black has a big territory framework on the left”). It knows the vocabulary of Go strategy and can apply it in a broad-strokes manner. The strength of LLMs here is their ability to communicate – they can explain concepts or rationale in plain language, something traditional Go AI doesn’t do out-of-the-box. Early experiments where GPT-4 was asked to comment on each move of a human-vs-human game showed that it produces “garden-variety commentary” – meaning it makes generic observations about the moves (e.g. “Black approaches the corner, aiming to secure territory”) which are sometimes reasonable (I played chess against ChatGPT-4 and lost | Hacker News). As the moves became more complex, however, the commentary often devolved into nonsense or inaccuracies (I played chess against ChatGPT-4 and lost | Hacker News), because the model lost track of the actual board state and just continued spouting truisms. In one case, GPT-4 on a 9×9 game started praising moves or calling them mistakes even though it had the board state wrong by that point – illustrating that its “insightful commentary” was largely untethered from reality after a while.

That said, when not pushed beyond its limits, an LLM adds value in explaining Go ideas. Many users have leveraged ChatGPT as a teaching tool: asking it questions about rules, Go terminology, or even to create little quizzes and puzzles. For example, you can ask it to explain why corners are important, or to elaborate on the concept of sente and gote, and it will produce a clear explanation (drawn from Go literature it was trained on). It may even generate basic Go problems or imaginative analogies (like comparing a Go strategy to warfare or painting) which can be fun and educational. The key is that its knowledge is broad but shallow – it lacks the on-board precision that a true Go analyst has. A common observation is that LLM commentary “sounds” expert and is linguistically fluent, but you have to be careful trusting it on specifics. Without verification, an LLM might mis-evaluate life-and-death situations or suggest an obviously bad move as a good one, simply because it doesn’t actually see the position – it only predicts likely-sounding advice.

Comparison to Go Engines

Go engines (like AlphaGo, Leela Zero, KataGo, etc.) are specialized AI systems that have famously achieved superhuman play in Go. They work very differently from LLMs: engines use a combination of deep neural networks (trained via millions of self-play games) and tree search to evaluate positions and find optimal moves. In terms of playing strength, any of these engines (even run on modest hardware) will crush a general LLM in Go without breaking a sweat. Even the weakest publicly available Go AI (say, one on a phone app) is orders of magnitude stronger than GPT-4 when it comes to playing the game. LLMs simply do not have the exhaustive search or the refined value function that Go engines have developed. This was evident in all experiments – where engines play near-perfect legal moves, GPT-based play was chaotic and illegal by comparison (I played chess against ChatGPT-4 and lost | Hacker News). So, for head-to-head playing ability, there is no contest: Go engines are at superhuman-pro level, LLMs are at beginner level (or effectively unranked).

However, when it comes to commentary and explanation, the roles reverse. A typical Go engine can output a lot of data about a position – e.g. “win rate now 67% for Black”, “best move at E4, variation shows sequence …” – but it does not speak in human-friendly terms. It won’t tell you why a move is good in plain language or what strategy each player is following (at least not without additional tools). This is where LLMs shine: they can take information and put it into natural language narratives. The combination of these technologies is promising. In fact, some community developers have already built systems where an LLM is hooked up to a Go engine as a backend tool (Ask questions over Sensei's Library ChatGPT-style - Page 5 - Go Resources - Online Go Forum). In such a setup, the LLM can query the engine for the best move or an evaluation, then explain the result conversationally. This approach effectively masks the LLM’s weaknesses (since it no longer has to figure out the move itself) and utilizes its strength in communication. For example, an LLM-powered Go tutor might ask KataGo for analysis of a user’s game, then translate that into lessons: “KataGo thinks your move at D14 was a mistake because it left your group unsettled; a better move would have been at C13 to secure two eyes.” The LLM can produce this kind of insightful commentary if it’s fed the right facts from a reliable source (the engine) (). We saw a similar approach in the chess domain, where LLMs were used to generate commentary using concepts provided by a chess engine () ().

Comparatively, if an LLM is not given engine assistance, its commentary may be more entertaining or easier to read than an engine’s raw data, but it’s also far less trustworthy. It might be fine for casual discussion or reviewing broad themes of a game, but any serious Go analysis today still relies on the dedicated engines. In essence, LLMs complement rather than rival Go engines. The engine provides playing strength and accurate analysis, and the LLM can provide explanation and a conversational interface. Many Go players see potential in this synergy – for instance, integrating an LLM into online Go servers as a kibitzing commentator or a teaching tool that can answer questions about a position (sourced from an engine’s analysis). But as a standalone, an LLM cannot match the depth of understanding that even older Go engines have achieved.

Future Prospects

Going forward, one big question is whether newer AI models will narrow the gap in Go performance without simply invoking an external engine. There are a few trends suggesting that future multimodal or hybrid models might do better. DeepMind’s CEO Demis Hassabis has hinted that their upcoming models (e.g. Gemini) will combine techniques from AlphaGo with large language models (Demis Hassabis: "At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models. We also have some new innovations that are going to be pretty interesting." : r/mlscaling). In his words, Gemini is envisioned as “combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models” (Demis Hassabis: "At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models. We also have some new innovations that are going to be pretty interesting." : r/mlscaling). This implies that future general AI might internally incorporate reinforcement learning or planning modules that were pioneered in Go engines. If that happens, we could see an LLM that actually understands Go at a strategic level and can both play strongly and discuss the game. For example, a Gemini-like model might use an internal simulation (self-play thinking) to choose a Go move, then use its language module to explain that move. Such a model could potentially achieve much higher playing strength than today’s GPT-4, perhaps even challenging traditional engines, while preserving the ability to output natural language commentary. As of early 2025, this is speculative, but it’s a clear direction that major AI research is exploring.

In the nearer term, improvements to LLMs’ context length and training might yield incremental gains. One current limitation is that ChatGPT’s memory of the game (the prompt context) is limited – longer context windows (like GPT-4’s 32k token version) could allow it to “see” the entire game record longer and reduce some forgetting. Fine-tuning is another avenue: if one were to fine-tune an LLM on a large corpus of Go game records and commentaries, the model might learn to produce more accurate move sequences and better-informed commentary. There have been small-scale projects in other games (for instance, fine-tuning a smaller LLM to play chess at an intermediate level by feeding it thousands of games). A similar fine-tune for Go could train the model to follow the rules strictly and mimic human-like moves up to a certain strength. However, due to Go’s complexity, a purely supervised learning approach (learning from human games) has limits – even strong amateurs are far weaker than top engines, and the model would likely plateau at the level of training data (this was seen historically with pre-AlphaGo Go AI that trained on human data). To really reach expert play, reinforcement learning or search (as used by AlphaGo) seems necessary. Incorporating those into an LLM is an active research challenge.

On the commentary side, we can expect LLMs to become even more useful. As more Go literature and possibly annotated game data become available, an LLM could be fine-tuned or prompted to give more precise commentary. We might see specialized LLM-based Go coaches that can analyze your game (with an engine’s help) and then give you a lesson, much like a human teacher would. This could include pointing out mistakes, suggesting alternatives, and answering “what if” questions in a conversational manner. In fact, even with current tech, some users have manually done this: they run a KataGo analysis on a game, then feed the key points to ChatGPT to formulate a human-readable report. Automating that process with a fine-tuned model is a foreseeable development.

In conclusion, as of 2025 general-purpose LLMs are no match for Go engines in playing strength, and they have only a surface-level “understanding” of Go positions. Their value lies in communication – they can talk about Go in ways that are engaging and accessible. With ongoing research, we anticipate a new generation of AI systems that will blend the raw power of Go engines with the versatility of LLMs. When that happens, we might finally have an AI that not only beats the world champion (which has already been done) but can sit down afterward and explain the game to the world champion in plain language. Until then, LLMs playing Go on their own remain more of a curiosity and a teaching aid than a competitive force. The trend is that each iteration (GPT-4, Claude, and presumably Gemini and beyond) adds a bit more capability – so we’ll watch keenly whether the gap closes in the coming years, or if Go remains a domain where specialized engines reign supreme and LLMs serve as supportive commentators.

Sources: Published research on combining LLMs with game-playing AI ([2403.05632] Can Large Language Models Play Games? A Case Study of A Self-Play Approach) ([2403.05632] Can Large Language Models Play Games? A Case Study of A Self-Play Approach) (); community experiments with GPT-4 playing Go and providing commentary (I played chess against ChatGPT-4 and lost | Hacker News) (I played chess against ChatGPT-4 and lost | Hacker News); examples of ChatGPT-generated game commentary (ChatGPT about Go • Life In 19x19) (ChatGPT about Go • Life In 19x19); discussions on future models like Gemini integrating AlphaGo techniques (Demis Hassabis: "At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models. We also have some new innovations that are going to be pretty interesting." : r/mlscaling).